Kohonen Maps¶

The Kohonen Maps are actually an interesting unsupervised learning algorithm, responsible to clusterize the dataset. I will not discuss a lot about this technique here, because bellow you will see a pretty interesting explanation on how this algorithm works.

Reading the DataSet¶

[1]:

import pandas as pd

data_path = './ifood-data-business-analyst-test/ml_project1_data.csv'
dataset = pd.read_csv(data_path)

Preprocessing¶

Realise that this preprocessing is very close to the one presented before with in the XGBoost technique, therefore, if one wants to jump this part to the model part, please go for it.

Here we need to provide a simple preprocess to the data to remove possible non informative data, to create information fields that are more suitable for interpretation, some encoding of the features (since some are categorical), also we will make some normalization on the data to avoid over weighting errors and so on…

Notice that most functions to do the preprocessing here are implemented in a separated code, since it could be used for other models, and for later analysis.

[2]:

from utils import *

Pipeline¶

The preprocessing pipeline, for the XGBoost classification algorithm will be the one, as follows:

Step #1 First we will replace some fields with more interpretable information (Birth date => Age, Customer Registration => Persistence, …)
Step #2 Then we are going to replace the categorical data set with an encoded one (categorical variables => numerical variables)
Step #3 Then some non informative features will be dropped from the analysis, e.g. features that are constant in all samples (which does not provide any information)
Step #4 Since we have only 24 samples with NaN (or null) values, we can drop those from the dataset, instead of concerning with interpolation and so on…

[3]:

dataset = support.replaceFields(dataset)                # Step #1

dataset, encoders = support.encodeDataSet(dataset)      # Step #2

dataset = support.dropNonInformative(dataset)           # Step #3

df = dataset.dropna()                                   # Step #4

Features dropped: ['Z_CostContact', 'Z_Revenue']

Here we have a simple treatment of the data, by doing some normalization and then some balancing of the dataset, the same wa as discussed in the XGBoost section. Where there is more details, but for now, we create the model in a regression format, with regressors (phi) and target (the output). After we do the normalization between one and zero, and finally the dataset is balanced to have the same amount of 1 outputs as 0 outputs.

[4]:

import numpy as np

# Create the regression format
phi = df.loc[:, ~df.columns.isin(['Response', 'ID'])].to_numpy()
target = df["Response"].to_numpy()

# Normalization
max_vals = np.amax(phi, axis=0)
min_vals = np.amin(phi, axis=0)
phi_n = (phi - max_vals) / (max_vals - min_vals)

# Balancing the data
X, y = support.balanceDataSet(phi_n, target)

The Kohonen Map¶

Here this model is known as SOM (Self Organizing Map), but was actually created by Kohonen and it is known between the mathematicians as Kohonen Maps. This model is a unsupervised learning algorithm that builds a weighting image based on input provided features. The main idea is that this model adjusts it self, any \(\mathbb{R}_{(n_x, 1)}\) variable into a \(\mathbb{R}_{(n_i, n_i)}\) dimensional space where both \(n_x, n_i \in \mathbb{N}\).

A simpler way to understand is that this algorithm is able to get a sample vector, and transform it into an image! This is one of the most interesting techniques of clustering images… this algorithm was the first one used by the Detran in Brazil to classify the letters of a vehicle plate! I know that because I know the guys that created that for Detran Hehe. It is pretty powerfull since it measure the pattern information at each point, and usually, by itself is able to classify data without using anything more. So here, we will try to use this approach to see if the data has highlly distinctable patterns, that without the label, we would be able to classify the customers.

[5]:

from minisom import MiniSom
import matplotlib.pyplot as plt

# Normalize the dataset
data = X - np.mean(phi_n, axis=0)
data /= np.std(data)

# Initialization and training
som_res = 7 # The image resolution
som = MiniSom(som_res, som_res, 25, sigma=2., learning_rate=.5, random_seed=100)

# Initialize the net weights
som.pca_weights_init(data)

# Train the model
som.train_batch(data, 100000, verbose=True)

# Plot the weight image
plt.figure(figsize=(7, 7))

# Plotting the response for each pattern in the iris dataset
plt.pcolor(som.distance_map().T, cmap='bone_r')  # plotting the distance map as background
plt.colorbar()

# Use different colors and markers for each label
markers, colors = ['o', 's'], ['C0', 'C1']
for cnt, xx in enumerate(data):
    w = som.winner(xx)  # getting the winner
    # palce a marker on the winning position for the sample xx
    plt.plot(w[0]+.5, w[1]+.5, markers[target[cnt]], markerfacecolor='None',
             markeredgecolor=colors[target[cnt]], markersize=12, markeredgewidth=2)

plt.show()

 [ 100000 / 100000 ] 100% - 0:00:00 left
 quantization error: 3.376497223442264

Notice that the orange squares and the blue circle, represent each, the labels 1 and 0. Here the perfect solution would be to not have any circle overlapping a square, and vice versa, if one wants to check an ideal classification using self organizing maps, please check out this link. Here we cannot segregate correctly the data, therefore it is not so clear that the model without guidence (unsupervised learning) is able to tell the diference on the data. Of course there are some circles and squares that are alone, and those provide a consistent representation of 0 and 1 outputs, respectivelly.

Surelly this model could be used as a oposite afirmative… for example, if the data pass on the self organizing map and land on the field where there is only a circle, it will definitely not be a 1 output, and vice versa. But probably it is to much work to have only a little enhance on the final performance.

[ ]: